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ABSTRACT 

In this paper we review ne ^J^ted issues of simultaneous statis- 
tical inference and statistical power in survey research applications of 
the general linear model, and we find that classical hypothesis testing 
as it is currently applied, is inadequate for the purposes of social 
research. The intelligent use of statistical inference demands control 
over the overall level of Type I error and knowledge of the magnitude 
of effects one is likely to detect. We suggest techniques that can be 
used to routinely incorporate considerations of simultaneous inference 
and power into the statistical analysis of survey data. Several examples 
of applications of these techniques are presented. 
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I, Introduction 

Our purpose In this paper Is to provide for a more Informed use 
of statistical Inference In tests of hypotheses In survey applications 
of the general linear model (GIM). This model, like any model. Is com- 
prised of a set of assumptions that permit the derivation of certain 
general principles. The assumptions of the GIM are of particular utility 
to the survey researcher In that they permit one to draw Inferences 
about the structure of relationships among variables to larger popula- 
tions on the basis of sample survey data. In this paper we suggest 
that whfin conducting multiple statistical tests of hypotheses within 
the GUI framework, our results will be more meaningful If we know the 
overall probability of rejecting a false null hypothesis and the prob- 
ability of finding statistically significant results when substantively 
meaningful effects exist. 

By following the current practice for doing multiple tests for 
hypotheses on parameters of linear models, researchers are inadequately 
controlling the probability of rejecting a true null hypothesis— the 
probability of making a Type I error. Inference considerations in 
situations where multiple tests of hypotheses are conducted are qualita- 
tively different from procedures described for single hypothesis testing 
in most texts. Procedures currently employed yield Type I error rates 
that can be considerably lower than the true probability of rejecting 
a true null hypothesis when a number of hypotheses are tested. Scientific 
norms of parsimony that dictate researchers be conservative in their claims 
of the empirical effects of social variables are clearly violated as 
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social researcherB systematically underesttoate the likelihood that 
Type I errors are occurring In their analyses. Drawlnp. on a considerable 
body of literature on simultaneous Inference In the RLM, we argue that 
analysts of survey data must reconceptuallze their approach to statisti- 
cal Inference. We discuss several techniques for treating the simulta- 
neous inference problem In the GI^. and present, with examples, procedures 
for applying these techniques to survey data. 

In current research practice little concern has been expressed for 
the power of RLM statistical tests. Analysts of large sample survey data 
often dismiss power considerations with the assertion that their tests 
have "more than enough power"-even trivial effects yield statistically 
significant results (Blau and Duncan, 1967, pp. 17-18). We present ex- 
amples below to demonstrate that for many GLM hypotheses, whether or not 
the tests are characterized by "more than enough power" can be quite 
problematic. Indeed, most researchers are confronted with a situation 
where they must take as given two important determinants of the power of 
statistical tests, sample size and the configuration of independent varl- 
ables.l „^ ^rgue that it is imperative that the analyst compute 

the magnitudes of the effects that are likely to be detectable in a given 
set of data. As we shall demonstrate below, the power of GLM tests can 

be routinely calculated. 

The importance of the consideration of Issues in simultaneous 
statistical Inference and power for the informed use of statistical 
tests of hypotheses requires that the survey researcher be aware of 
major Issues and procedures pertaining to these two areas. In this paper. 
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wfe provide a critical examination of issues and procedures in simul- 
taneous statistical inference and statistical power as they apoly to 
the research. We begin our exegesis with a brief review of assump- 
tions of the niM and common hypothesis tests in survey research applica- 
tions. Drawing on the body of literature on simultaneous inference in 
the GLM, we argue that analysts of survey data must reconceptualize their 
approach to statistical inference. We discuss several techniques for 
treating the simultaneous inference problem in the OLM, and present, 
with examples, procedures for applying these techniques to survey data. 
Following the treatment of simultaneous inference we examine factors 
that influence the ability to detect substantively meaningful effects- 
statistical power. A procedure for estimating the power of statistical 
tests is discussed and illustrative examples of the influence of various 
factors on statistical power are presented. We conclude with some sug- 
gestions for improving the use of statistical inference in making meaning- 
ful decisions about the merits of the hypotheses being tested. 

II. The General Linear Model; Assumptions and 
Common Hypothesis Tests 

A. GLM Assumptions 

We shall concern ourselves with tests of hypotheses about the param- 
eters of the GLM, the classical model stated in matrix terms as follows: 

y X 3 e 

<»rry " OMO (Nxl) 
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(2) E (e ) - 0 

(3) E (ee') - a^I 

(4) £ ^ N (0, 0^1) 

(5) X Is fixed (nonstochastlc) and of full colunn rank.^ 
While the theory of statistical Inference for the GLM was originally 

developed for the above model, assumption (5), fixed X, Is clearly unten- 
able In the application of the model to survey data. It requires that 
the sampling deslgr include a priori stratification on all independent 
variables. I.e. j^r lorl specification of cell sizes for each combina- 
tion of the levels of the Independent variables. A modification of the 
above model allows for the typical survey design of multivariate sainpling 
from the Joint distribution of ^ and X> We replace (2) through (5) 
above with the following assumptions: 
(2a) E (£|X) - 0 

(3a) E (ee • jx) - oh 

M l\L N(o,a^i). 
Thus it is required that the classical assumption holds conditionally on 

3 

X. The disturbance rmst be mean independent of the independent variables 
and be conditionally independently normally distributed with zero mean 
and constant variance. While this more appropriate conditional GLM 
presents no differences in the treatment of Type I error, it does com- 
plicate the treatment of power. While our results with respect to Type 1 
error hold unconditionally, the procedures for power calculations present- 
ed in tiris paper give results conditional upon the values of X realized in 
a particular sample (Grayblll, 1961, pp. 204-205; Sampson, 1974). The 

8 
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conditional power calculations presented herein must he considered upper 

4 

bounds upon the unconditional power of the tests. 

B. Wy potheals Testing In Survey Appli cations of the GLM 

In Table 1 present in outline of the types of OLM hypothesis 
cotmnonly tested In survey applications and the statistical tests applied 
to those hypotheses. In (1) we have the test of an Individual coefficient, 
6^. The t-test is just b^ - 6* divided by the standard error of b^, the 
usual t-ratlo computed in regression programs. The one degree of freedom 
F-test Is merely the square of the t-test. 

Hypothesis (2) Is the test that a subset 'of J coefficients are Joint- 
ly equal to a set of J specified values. When 6.*2) is specified to be a 
vector of zefos and J = K - 1, It Is the common "overall" F-test of no 
regression. When J < K - 1, and 6*2) is a vector of zeros, it Is the 
"Increment to R^" F-test for a subset of variables. 

Hypotheses (1) and (2) comprise the majority of hypotheses tested 
In survey applications of the GLM. Although seldom conducted In non- 
experimental applications of the GLM, a researcher may want to test 
whether linear combinations of the coefficients are equal to some specified 
zero or nonzero values. Paralleling hypotheses (1) and (2), one can test 
a single linear combination with a t-test^ or one degree of freedom F-test, 
or jointly test J linearly Independent linear conblnatlons of coefficients. 
Indeed, hypotheses (1) and (2) are special cases of (3) and (4). 

Finally, each of the F-tests for the hypotheses can be considered 
"Increment to R^" tests with J numerator and N - K denominator degrees 
of freedom as given In the equation for the u statistic found In Table 1. 

9 
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III, Simultaneous Statis tical Inference 

In most applications of the GLM In survey research more than one of 
a single type of the above delineated hypotheses is tested or more than 
one type of hypot>..sls Is trsted. Frequently some set of Interaction 
effects are tested jointly and main effects are tested Individually. 
When the effects of a categorical and one or more continuous Independent 
variables are analyzed, often a Joint test of the effects of the set of 
dummy variables representing the categorical variable and one or more In- 
dividual tests of the coefficients for the continuous variables are perform- 
ed. In applications of the GLM that Involve a single equation, the per- 
formance of multiple t-tests on Individual slope coefficients Is a uni- 
versal practice. Finally, It Is becoming standard practice to do multiple 
tests on all possible slope coefficients In simple recursive structural 
equation models* 

We have briefly noted above the researcher who performs such multiple 
hypotheses tests Is In a qualitatively different Inference sltuatlon-that 
of simultaneous statistical Inference-than the researcher who performs 
only a single hypothesis test. In this section we shall consider both 
how the single and multiple hypotheses cases differ from the standpoint 
of Inference, and techniques of statistical Inference that are appropriate 
to the mulltple hypotheses test situation. First we shall examine these 
two Issues m general and then we shall consider them as they apply to 
the standard tests conducted within survey research applications of the GLM. 
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A, r.e neral iBSues In Simultaneous Statisti cal Inference 

An understanding of the basic difference between the single hypothesis 
and multiple hypotheses cases can best be achieved by first recalling the 
definition of Type I error In statistical Inference. Type I error Is 
the error of falsely rejecting a true null hypothesis. For the researcher 
who performs only a single null hypothesis test, this definition presents 
no problem. The probability that he will f alPf ly reject this single null 
hypothesis Is the probability of Type I error In this case. The research- 
er can straightforwardly proceed by following the suggested standard pro- 
cedure of specifying a level, 1 - a, of protection against a Type I error, 
and then proceed to perform his statistical test accordingly. Now consid- 
er what happens If this same researcher sometime In his life performs 
additional tests of null hypotheses according to the suggested standard 

procedure. That Is, he speclflesja level 1 - a of desired protection 

against a Type I error and conducts each of his statistical tests at 
this level. 

If we reflect now on the definition of Type I error we realize that 
for this researcher the actual level of protection against making a Type 
I error In the multiple null hypotheses case Is less than 1 - a. Thus, 
this researcher Is overestimating the protection he has against falsely 
rejecting a true null hypothesis. The problem with employing the conven- 
tlonal procedure for making tests of multiple null hypotheses arises be- 
cause the probability of making a Type I error In this case Is the prob- 
ability of falsely rejecting any one of the Individual null hypotheses— 
which equals the probability of making a Type I error for the first null 
hypothesis, or for the second null hypothesis, or for the nth null 
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hypothesis, OL combination of the n hypotheses. Except in the 

case of total dependency among the null hypotheses tested, this prob- 
ability is grester than the a level under which each of the null hypoth- 
eses were tested. 

Essentially the solution to this problem is provided by the research- 
er's decision concerning which null hypotheses will be grouped together 
for the purpose of considering Type I error — generally referred to as the 
specification of the unit of error rate. Given this decision, the research- 
er can proceed to do each of the tests of individual null hypotheses in 
such a fashion that the desired level of protection against making a Type 
I error for the group of null hypotheses has been provided. 

Two extreme groupings of null hypotheses can be identified. First, 
one could consider as a group all the null hypotheses tests that a 
researcher or a group of researchers will do in his or their lifetime. 
By grouping in this maimer the researcher would be provided with pro- 
tection at a specified level against ever falsely rejecting a true null 
hypothesis. Second, one could consider each individual null hypothesis 
test as a group for inference purposes. This grouping is generally called 
a per-comparison unit of error rate and effectively removes one from the 
simultaneous inference situation. The first extreme grouping essentially 
has been rejected In discussions of appropriate units of error rate for 
research, and a general agreement exists that the upf r hound for grouping 
purposes is provided by the group of null hypotheses tested by one research- 
er in one study. However, there exists no consensus on what Is the most 
appropriate unit of error rate below this upper bound (Ryan, 1959, 1962; 
Wilson, 1962; Mille?, 1966). 
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The problem addressed by simultaneous statistical Inference tech- 
niques is that of how to perform tests of individual hypotheses such 
that one has protection at a specified level against making a Type I 
error for a group of hypotheses. Numerous techniques of simultaneous 
statistical inference have been designed to address this problem (Miller. 
1966; Kirk. 1968). many of which have been developed for specific types 
of tests within the Cm framework.^ However, two techniques, the 
Bonferronl and Scheffg. are of wide generality. 

The Bonferronl technique Is based oh the Bonferronl inequality. 

which states that 



(6) a < E a ; 
^ 1-1 *1 



where equals the significance level for a group of null hypotheses, 
a equals the significance level for each Individual null hypothesis 
ll'the group, and N equals the total number of null hypotheses In the group. 
The Bonferronl technique can be applied to virtually all situations of 

. „„ -,_4nr Imowledge of how many tests 
multiple hypotheses tests where one has prior knowxe g 

are to be conducted. 

The SchetfS method. In the GLM framework, provide, a me.„a of con- 
trolling error r.te for tests of .11 Possible linear comhlnatlona of 
the least s,„.re, estimate, of the slope coefficients. The ScheffI tech- 
nique is hased on the co^on ass^nptlons (dlstrlhntlonal and otherwise) 
of the r.c,. Its generality Is due to the fact that It allows th. research- 
er to perforr, an Infinite numher of tests of linear comhlnatlons of 8 
coefficients while protecting aealnst a specified value of Type I error 
for the group. 
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B. Simultaneous Statistical Inference In OLM Applications In Survey Research 



The first question that must he addressed Is, "Does one need to be 
concerned with Issues of simultaneous Inference?" The implicit answer 
given to this question in survey research applications of the GLM to 
date has been, "no." Virtually all analyses of survey data conducted 
within the HLM framework have implicitly employed a per-comparlson ferror 
rate. In general, analyses of multiple null hypotheses based on survey 
data have been performed in the following manner: A value of Type I 
error is specified and this value is used in tests of each individual 
null hypothesis. No consideration is given to error rate for any group 
of hypotheses. 

There are compelling reasons for believing that this implicit; an- 
swer is insufficient. The first reason is that the implicit answer is 
usually based on a lack of knowledge of the issues in simultaneous infer- 
ence. Basic textbook treatments of statistical inference, from which 
most social researchers' knowledge of this subject is obtained, generally 
ignore simultaneous statistical inference. Consequently, many social 
researchers are unaware or vaguely aware that a problem may exist in 
doing multiple tests of null hypotheses. 

Beyond this lack of knowledge there are important substantive reasons 
for considering simultaneous inference. Perhaps the most important of 
these is the fact that social researchers do not limit their concern to 
the determination of whether or not a single variable has a statistically 
significant direct effect on a given dependent variable, but extend 
their interest to the determination of whether or not a set of independent 
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variables affects a given dependent variable. Such analyses are fre- 
quently done 1 n the context of a causal model of a process that deter- 
mines variation In a dependent variable. This Is particularly the case 
m analyses done within the recursive structural equation fratnework. 
Here the researcher frequently begins with a specified causal ordering 
among a set of variables and a set of null hypotheses about the rela- 
tionships among this set of variables. It can be argued that since 
the researcher Is Interested In finding the correct model of a process 
m a population, the set of multiple null hypotheses used to find this 
model should be tested simultaneously. That Is, the researcher should 
provide protection against a specified level, a^^, of finding an Incorrect 
model of a process; since falsely rejecting any of the multiple null 
hypotheses Is In effect finding an Incorrect model of the process. 

Another reason for being concerned with units of error rate other 
than the per-comparlson unit Is the scientific dictum of conservatism" 
and parsimony. It Is generally thought that the acceptance of a false 
null hypothesis Is more desirable scientifically than the rejection of 
a true null hypothesis. Such a principle. It Is proposed, keeps the 
scientific literature from becoming unduly confused by false research 
findings and keeps scientific theories from becoming overly complex. 
Since the employment of a unit for error rate other than the per- 
comparlson unit makes It more difficult to capitalize on chance In con- 
ducting tests of multiple null hypotheses, the scientific dlctums of con- 
servatism and parsimony argue for the use of simultaneous Inference tech- 
niques. 
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A final reason is specific to the practice of "data snooping" or 
"data dredging." Both terms are used In reference to the practice of 
doing some previously unspecified number of tests within a body of 
data In an attempt to discover relationships among the set of variables 
analyzed. Such a practice Is undertaken either because the researcher 
has no prior h3rpo theses about the relationships among a set of variables 
or wishes co supplement an analysis of prior hypotheses. In this situa- 
tion It Is argued that since one approaches an analysis with an unspec- 
ified number of null hypotheses to be tested — the number tested could 
be one, several, or all possible tests — the scientifically honest pro- 
cedure Is to use a simultaneous Inference technique that provides pro- 
tection against a level of Type I error for all possible tests. 

Given that one has concluded that It Is desirable to employ simul- 
taneous statistical Inference techniques, the next question that must 
be addressed Is "What unit of error rate should be employed?" The most 
straightforward answer that can be given to this question Is simply that 
there are no hard and fast rules. The unit of error rate used is depen- 
dent upon the researcher's Judgement of what unit best suits the research 
proposes. We can make suggestions, however, about what seems to be 
appropriate units for certain applications of the common hypotheses tests 
delineated in Table 1. We will consider three such applications: (1) 
the prediction situation, (2) the use of "theory trimming" in simple 
recursive structural equation models, and (3) the use of various hypothesis 
tests in post hoc analyses of linear models. 
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Consider first the situation In which the researcher Is sltnply 
Interested In determining which variables among a set of Independent 
variables have significant, direct effects on a dependent variable. 
The Intent here Is usually that of discovering what variables are Im- 
portant determinants of variation In some dependent variable* • the 
usual procedure In this case Is t}}e performance of an Individual test 
of the hypothesis that 6^ equals Eero for each Independent variable. 
We suggest that all of the Indlvldtial hypothesis tests of the B*6 be 
grouped together for purposes of considering error fate* Such a group*^ 
Ing seema appropriate since the focus of this type of research is on 
the correct prediction of values of a given dependent variable. By 
grouping In this fashion, the researcher Is protected at the l^el 

1 - against making a Type I error In predicting values of a given 

G 

dependent variable* 

Secondly, consider the "theory trimming" strategy (Helse, 1969) 
often employed in the analysis of sMple recursive structural equation 
models. A common procedure in social research is the specification of 
a recursive causal ordering among a set of variables and the emplojrment 
of multiple t-tests of individual 6 coefficients (or their standardized 
counterparts) to determine which effects among those possible in a recur- 
sive causal ordering are significant.^ The intent here is Usually that 
of determining the most plausible model of some process in a population. 
We propose that all of the null hypotheses tested in the "theory trimming" 
process be considered as a group for error rate purposes. Because in 
this case, the researcher is interested in finding the correct model of 
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a process In a population, grouping In this fashion Is appropriate. 
Since falsely rejecting any one of the null hypotheses about the Indi- 
vidual B coefficients means that the researcher has found an Incorrect 
model of the process, protection should be provided against falsely re- 
jecting any one of the null hypotheses. By treating all of the null 
hypotheses as a unit for purposes of considering Type I error the re- 
searcher does so at the level 1 - . 

Finally, consider the use of the common hypotheses tests In the 
post hoc case. Frequently the results of one's analysis do not conform 
to the original expectations. The attainment of unexpected results may 
at least partially be attributed to Initial assumptions not holding. 
For example, one may have assumed that the relationships among a set of 
variables are linear and additive when In fact they are nonlinear or not 
additive. Many of these assumptions are testable with the data at hand 
and In such a situation the researcher may wish to perform a number of 
hypothesis tests to determine If the unexpected results are attrlbtxtable 
to the failure of meeting one's assumptions. 

Additionally, the results of one's analysis may suggest further 
tests that may be Interesting to the researcher or the researcher may 
simply wish to snoop around In the data In the hope of discovering an 
Interesting result. In all these post hoc analyses the researcher Is 
"data dredging." For the reasons of scientific honesty elaborated above 
we suggest that the researcher employ as the unit of error rate all 
possible tests of the B coefficients and employ the Scheffe techniques 
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After one has determined the unit of error rate to be employed, a 
simultaneous Inference technique must then be chosen. The Bonferronl and 
Scheffe techniques can both be applied to all of the common hypotheses 
tests performed on slope coefficients listed In Table 1. The tvo 
techniques do differ, however. In the advantages each presents In 
specific situations. Two criteria are of Importance In weighing the 
relative advantages of each technique: (1) the ability the techniques 
present to detect specific alternative hypotheses (I.e. statistical 
power), and (2) their applicability to a priori versus post hoc statis- 
tical tests. We shall first consider the mechanics of applying each 
technique and then weighing their relative advantages In terms of these 
two criteria. 

C. The Bonferronl Technique : 

To provide protection at level 1 - ^ group of null hypotheses 

via the Bonferronl technique one first determines the total number of 
Individual null hypotheses to be tested, m. Then, one divides by m 
and tests each Individual null hypothesis with a significance level 
equal to a-,/m. For example. If one wishes to test whether a subset of 
coefficients are jointly equal to zero and to test whether four addition- 
al coefficients are Individually equal to zero, with a group probability 
error rate of .05. one would simply conduct the tests corresponding to 
hypotheses (1) and (2) In Table 1, each with a^, equal to .01. 
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D, The Scheffe Technique 

The Bcheff^ technique Is applied to the various F-tests specified 
In Table 1. It requires that one first perform a test of the Joint 
null hypothesis that all of the coefficients from which subsequent tests 
of the nature of those In Table 1 will be conducted, with 1 - equal 
to the desired level of protection against a Type I error. If this test 
Is nonsignificant one stops here and performs no further tests, since 
tests of any linear combination of these $ coefficients (this Includes 
as well hypotheses (1) and (2) In Table 1) will prove nonsignificant. 
If, on the other hand, one can reject this Joint null hypothesis, then 
one carries out any and all of the tests In Table 1 by using as the 
critical value of the test statistic for all Individual null hypotheses 
tests, the quantity 

where J equals the degrees of freedom from the Joint null hypothesis 
test that all coefficients equal to zero. For example. If the researcher 
wishes to conduct Individual tests of three different linear combinations 
of four B coefficients (tests of the form of hypothesis (3) In Table 1) 
one would use the specified test In Table 1 with the critical value of 
the test statistic equal to 

^^"^'(4, N - K) • 
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E. Bonferronl and Scheffg Procedures; S ome Comparisons 

Note first that the Bonferronl and Scheffe procedures are con- 
servative. In general the actual value of the group error rate will 
be less than that desired. Hence, one will have greater protection 
against a Type I error than Initially specified. The Bonferronl pro- 
cedure, as we had mentioned. Is based on the Bonferronl Inequality and 
the fact that It produces only approximations to the actual group error 
rate that can be readily seen. The Scheffg technique provides an exact 
value of the group error rate for all possible linear combinations. 
However. It Is only a finite subset of these linear combinations that 
is ever tested, and consequently It. like the Bonferronl technique. Is 
conservative. 

The fact that the Scheffg procedure provides an error rate for all 
possible linear combinations, while the Bonferronl procedure Is based on 
a finite number of tests provides some Insight Into the ability of each 
to allow the rejection of Individual null hypotheses. Intuitively. It 
appears that the Scheffg technique will be less powerful than the Bonferronl 
technique because the former Is baaed on an Infinite number of tests 
while the latter Is not. In fact the ScheffI technique will always be 
less powerful for the rejection of individual null hypotheses when m. 
the actual number of tests made, is less than or equal to J. the degrees 
of freedom for the numerator of the F statistic. On the other hand, when 
m is considerably bigger than J. the inexactitude of the Bonferronl pro- 
cedure is such that the Scheffe procedure provides greater power for the 
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rejection of Individual null hypotheses (Miller, 1966, pp. 62-63; 
Dunn, 1959). Rather than rely on the somewhat sparse literature 
comparing the power of simultaneous Inference techniques, the re- 
searcher can apply both the Bonferronl and Scheffg techniques (or 
others) and use the one that provides greatest power. Doing so is 
fully permissible In the a priori case since the choice of technique 
Is Independent of the data collected. 

The Schef fg procedure presents an advantage over the Bonferronl 
technique In post hoc tests. For reasons of scientific honesty elab- 
orated earlier, the Scheffg technique Is more suitable for searching 
one's data In an attempt to discover the nature of the relationships 

among a set of variables. 

We turn now to an examination of Issues concerning the power of 
statistical tests of Gm hypotheses. Before doing so, an Important im- 
plication of the conservative natur^^^ Bonferronl and Schef fg pro- 
cedures for the estimation of t/e power hypothesis tests must be 
noted. AS we shall discuss below, the smaller the a level for a hypo- 
thesis test, the less power of that test (holding other factors constant). 
AS a consequence, the ease of applicability of the two procedures Is 
purchased at the cost of an overestimate of the power of a test (where 
the degree of overestimate depends on the sl^e of the dlscrepency be- 
tween the conservative a and the true a). This fact, together with the 
conditional nature of the power calculations noted above, malces it imper- 
ative that we stress that the power calculations to be presented below 
should be taken as absolute minimum Type II error rates. , 
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IV Power 

Once appropriate statistical tests and Type I error rate have been 
selected, to calculate power, the probability of rejecting a false null 
hypothesis for any test, we must determine the probability that the test 
statistic for the hypothesis exceeds the critical value when a given al- 
ternative hypothesis Is truet The GLM test statistic u is distributed 

as a noncentral F when an alternative hypothesis Is true, with J and ' 

2 

Ifr- K degrees of freedom and noncentrallty parameter 6 . The distribu- 
tion has the property that the probability of the statistic u exceeding 
a given critical value (and consequently power) Increases monotontcally 
with 6^. The noncentrallty parameter Is a function of (among other things) 
the degree to which the null hypothesis Is false. For the most general 
GLM test, the test of a set of J linear combinations of coefficients Is 

(7) Hq! A6 - Ag* and, 

the noncentrallty parameter, o Is 

(A6 - A6*)' (A(X'X) A')^^ (M - M*) 

(8) 6^ • 

y.x 

2 

Figure 1 presents a plot of power as a function of for various 
combinations of Type I error rates and numerator degrees of freedom, 
and arbitrarily large denominator degrees of freedom. It can be seen 
that for a given « and 6 , power decreases with numerator degrees of 
freedom J, and that for given J, power Is monotonlcally related to the 
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probaMllty of a Type 1 error. All three of these relationships are , 
iBportant when we consider the Implications of simultaneous Inference 
for power. 

In Table 2 we present alternative expressions for the noncentrallty 
parameters for the test of an Individual coefficient and the joint test 
of J (J < K - 1) coefficients. The noncentrallty parameters are pre- 
sented as functions of the original GLM parameters and standardized 
parameters.^^ Looking first at the test of the k'^ Individual coefficient 
we see Immediately that 6^ (and therefore power) Increases with the de- 
gree departs from Its hypothesized value B|^. Noting that x' M* i i« 
Just the sum of squared residuals for the regression of the k^^ Independent 
variable on the remaining K - 2 Independent variables, we conclude also 
that power Increases with the orthogonality of the k^^ Independent vari- 
able to the others. We see this again In the (1 - R^^^^*) ^" 
standardized expression, and note also that, of course, power Increases 
with sample size. From the standardized expression we also see that 
power increases with the proportion of variance explained (I.e. as 
a2 Itp- decreases). None of these results should be surprising. The 
ability to reject a false null hypothesis Increases with the degree to 
which It is false, the degree to which the effect being tested Is non- 
redundant with the effects of other parameters, the amount of data avail- 
able, and the overall power of the linear model. 

The 6 2 parameter for the joint te%t of J (J £ K -1) coefficients 
can be Interpreted as a multivariate extension of the single coefficient 
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case. Both }li^2 —22 1 Tneasures of the degree to which the 

covariation among the J variables with coefficients being tested Is 

orthogonal to the covariation among the remaining K - J - 1 Independent 

variables, and - B* Is the vector discrepant between the true values 

of the J coefficients and their hypothesized values. Indeed for all 

GUi tests we can conceptualize the noncentrallty parameter as a scaler 

measure of the degree to which the null hypothesis Is false, weighted.' 

by the amount of Independent Information available. 

Given a substantively meaningful alternative hypothesis one would 

wish to be able to detect, the configuration of Independent variables 

In the model, and the completeness of the model as measured by the pro** 

portion of variance explained. It Is a trivial matter to program a com- 
2 

puter to compute 6 as expressed In equation (8) for any general linear 

2 

hypothesis and any specified alternative. Thus given 6 and n, one has 
enough Information to determine the power of the test from Pearson and 
Hartley charts (Scheffe, 1959: 438-445; Kirk, 1968: 520-547), We now 
present examples of calculations of power as functions of n and the 
degree to which the null hypothesis Is false, 

A, Determining the Power of GLM Tests; Some Examples 

Is It Indeed the case that survey researchers are typically confronted 
with testing situations where they have "too much" power, l,e. Is It 
usually true that trivial departures form the null hypothesis result In 
statistically significant tests? It Is Impossible to answer this question 

• v-t ^ V «i«-k«* a^rstti^ar^^r ♦•Vi/Mit" /l/>-fn» r\r\Tjekir 1 1 1 one OllT* Cfl 1 Cll 1 A t*^ OTIQ 
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presented below show that having too much power Is by no means generally 
the case. Furthermore, we argue that If a researcher is to report the 
results of statistical tests. It Is always Imperative that the magnitude 
of the effects, trivial or nontrlvlal, which are likely to be detected 
also be presented. 

Consider the following linear model: 

(9) - Bi + B^""!! ^ B3X,3 + e^X^^ 4- 65X^5 4- ggX^g 4- . 

where: 

y =» Income, 
X2 = Education, 

» Occupation, 
X, » Parental Income, 
X^ » Father's occupation, 
Xg * Father's Income. 

When models of the socioeconomic achievement process such as equation 
(9) are estimated, the researcher Is usually Interested In hypotheses 
about all five coefficients (excluding the Intercept), • • ^6* "^^ 

maintain an overall protection of 1 - a « .95 against Type I error we 
can conduct simple t-tests on the five coefficients at the .01 level 
(Bonferronl), or compare the usual 1 degree of freedom F-'tests (the 

.0 5 ^ 

square of the t-test) to the 5Fg ^ _ critical value (Schef f e) . 

Let us consider the power of the test of the education coefficient. 
Hp : ^2 " ^9 when the true net return to a year of education Is $150, 
H : Bo • 150. Given this null hypothesis and a meaningful alternative 
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fiypothesls, how large of a sample size Is required to have a reasonahle 
likelihood of detecting an education effect of $150 a year In samples 
drawn from the United States labor force? We know' from the expression 
for the noncentrallty parameter In T.^ble 2 that the power of the test 
also depends on the variation In and covariation among the Independent 
variables, and the proportion of variation explained In the model. Asstim- 
Ing that the five social variables explain about 15 percent of the Income 
variation In the United States labor f wee, and using published correla- ~ 
tlons and variances of the five variables from a sample of Wisconsin high 
school gradixates (Sewell and Hauser, 1972), we can calculate the noncen- 
trallty parameter as a function of sample size. From Pearson-Hartley 

12 

charts we can then plot power as a function of sample sl2e« 

In Figure 2 we present the plot of power as a function of sample 
size for a Type I error rate of .05 for: (1) the Bonferronl test, 
(2) the Scheffe projection, and (3) a simple t-test not controlling 
for overall Type I error rate (the noncentrallty parameter Is Identical 
for all three tests). We see, as noted abbve, that the Bonferronl test 
is slightly more powerful than the Scheff^ projection, and that compared 
to the simple t-test one must pay a price In terms of power In order to 
control for overall error rate. Thus, If one Is going to test the net 
education effect with a Bonferronl test, a sample size of at least 2200 
observations Is required to achieve a power of .90. 

Does the above result imply that national samples of more than 
10,000 observations, tests on coefficients f the Income determination 
model will have "more than enough power"? This Is only true for the 
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specific null and alternative hypotheses specified above. Consider a 
different hypothesis on the same education coefficient. Suppose that 
from a census of the population we know that the net effect of a year 
of education In 1960 was $150. In 1975 we are to collect a sample In 
order to detect changes In through and we want to be able to de- 
tect a change ixi of $30 a year In either direction. In this case we 
are testing a nonzero null hypothesis, Hq * ^2 " ^2 " against a 

nondlrectlonal alternative: H^^ : | ^2 " H | " Using the same 

Information as In the previous example, we have determined power as a 
function of sample size for this test and present the plot In Figure 3. 

In order to detect a change of $30 a year with a Bonferronl test with a 

13 

power of .90, a sample of about 60,000 observations would be required. 

The power of joint tests on coefficients Is generally greater for a 

given sample size. Figures 4 and 5 present the power of Joint teste on 

^2* ^3 ^4 (where we have assmed that no hypotheses concerning 6j 

and are to be treated). Figure 4 presents the power of the test of 
6 

the Joint null hypothesis that ^3 ^4 versus the 

alternative that they each have standardised effects of .10. Figure 5 
presents the power to detect a Joint standardized change of .02 In each 
coefficient. Once agal^, the sample size needed to detect the joint 
change with a power of .90 Is relatively large, nearly 9,000 for a - .Ol. 

We conclude from the above examples that It Is by no means g^^^teed 
that tests based on large sample surveys have "mote than enough" power. 
TThlle It Is true that as our theories become more powerful and our models 
become more precise representatives of empirical processes, the Increase 
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in the proportion of variance we can explain will unilaterally Increase 
the power of our statistical tests, we will also become Interested In 
detecting increasingly smaller effects. Indeed, in situations where we 
are interested in detecting change through replications of surveys, it 
is likely that we will wish to detect relatively small effects with little 
or no increase in the proportion of variance explained in the replication. 

Analysts of survey data are most often confronted with a situation 
where the data have already been collected. Sample size and the config- 
uration of independent variables are given, and the researcher wishes to 
test hypotheses of the parameters of a model on the given data set. In 
such a situation the relevant power calculation is power as a function 
of the degree to which the null hypothesis is false. From equation (8) 
and Table 2 we see that the only additional information needed to compute 



or more tests are not powerful enough to detect substantively meaningful 
effects, two actions are possible. The researcher can increase a^, low^ 
ering the protection against making a Type I error. If this ia unaccep- 
table, the researcher must simply conclude that the data are inadequate 
for testing those particular hypotheses. 

In Figure 6 we present power as a function of the degree to which 
the null hypothesis is false for our example of the Bonferroni test of 
the education coefficient in the income model. For fixed sample sizes 
from 250 to 10,000 observations, the power of the test of the hypothesis, 
^0 ' ^2 " ^2* presented as a function of the absolute magnitude of the 
standardized measure of the degree to which the null hypothesis is false, 
I ft - ft.*L with a saii»?.e size of 250, we see that the standardized effect 




Should the researcher find that one 
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would need to be 'as large as .30 to be detected with any regularity 
(power of •90), while for a sample of 10,000 observations, an effect 
as small as .07 can be detected with near certainty. To detect an 
effect of .15, a researcher with a sample of 250 observations would 
have to conclude that the data are Inadequate, while a researcher with 
a sample of 10,000 would be In danger of finding "trivial" effects 
statistically significant and would perhaps decide to Increase pro- 
tection against Type I error substantially. 

A single short Fortran computer program based on equation (8) 

has allowed us to compute all of the power calculations presented In 

14 

this section. The logistics of these calculations are simple and 
could be routinely Incorporated Into regression or GLM computer packages. 
If for no other reason than to force Investigators to decide what 
effects In the population they would find substantively Important, 
power considerations should be Incorporated Into our GLM hypothesis 
testing procedures. It Is our hope that by Incorporating power and 
simultaneous Inference considerations Into our hypothesis testing proce- 
dures we can narrow the gap between "statistical significance" and 
"substantive significance." 

V, Conclusion 

Classical hypothesis testing as presented In most texts In a two- 
step procedure. One chooses a level of Type I error, a, and compares 
the test statistic to a critical value based on that a. After review- 
ing the neglected Issues of simultaneous Inference and power, we find 
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classical hypothesis testing Inadequate for the purposes of social 
research* The Intelligent use of statistical Inference demands 
control oy.er:_the overall level of Type I error and knowledge of the 
magnitude of effects one Is likely to detect. Our examination of 

specific techniques for dealing with the power and simultaneous 
Inference problems have led us to conclude that these techniques 
can be routinely Incorporated Into our procedures for the statistical 
analysis of survey data. Therefore, we suggest the following pro- 
cedures precede the testing of GLM hypotheses: 

1, Specify the hypotheses to be tested In terms of the param- 
eters of the linear model* 

2. Choose an acceptable Type I error rate, a^, for the group 
of hypotheses* 

3* Select the appropriate test statistics, Bonferronl or 
Scheffe, which provide protection against Type I error 
at the 1 - ot^ level* 

2 

4* Compute the noncentrallty parameter, & , and power of the 
tests as a function of the magnitude of the effects to be 
detected* 

The above steps will provide Information such that meaningful decisions 
can be made about the hypotheses being tested* This Information may be 
used In survey design for the rational choice of sample size, or for 
assessing the adequacy of available data for hypothesis testing* 

In recent years the use of statistical Inference as a criteria In 
scientific declslon-^aklng has been the subject of Increasing criticism 
(Morrison and Henkel, 1970)* Part of this criticism has been directed 
at the failure of standard procedures of statistical Inference to provide 
the kind of Information required for meaningful scientific decision-making* 
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In spite of this criticism standard procedures of statistical Inference 
continue to be employed. The continued use of the standard procedures 
can, at least In part, be attributed to the lacV, of a viable alternative 
when survey samples are analyzed. The use of our alternative procedure 
In our estimation, can contribute to a more Informed use of statistical 
Inference In scientific declslon-maklnf^. It does so by requiring that 
the researcher Rive more attention to the f,oals of his research In the 
use of statistical Inference as a scientific decision-making aid. The 
procedure we suggest requires that the researcher consider the purpose 
of the research In the selection of a meaningful unit of error rate* It 
also requires that the researcher give attention to the size of effect 
believed to be substantively significant In judging the adequacy of a 
given sample for decision-making purposes* 
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^The choice of- saf^Ple ^Ue In a survey design Is always subject to 
cost constraints^ and it is simply ImpoJaslble to take Into consideration 
the Impact of sample size on all hypotheses which will be subsequently 
tested on the datd* Vlith imiltlvarlate sampling, It is Impossible to 
fix a priori the vitiation in each independent variable and the covaria- 
tion among the independent variables. Thus, even those analysts fortunate 
enough to be involved in survey design have only limited control over 
the power of thelf ^ub^equent tests. 

^In models in Which an Intercept Is specified, the first column of 
X is a vector of ones, (l 1. • • 1); and the first eleraent of the 
vector Is the inter<iept parameter. All models considered in this paper 
will have intercepts specified. Thus K - 1 rather than K Is the number 
of independent vaf^^bies, 

^Thls requiret»t^i\t Is stronger than that of uncorrelatedness of £ 
and X; It Is a we^k^r assumption than statistical independence of £ and 

*The null dl^tttftutlons of GtM test statistics do not depend upon 
the configuration of tfte X matrix, and consequently the conditional and 
unconditional Type 1 e^ror rates are equivalent. The nonnull distribu- 
tions of the GLM te^t statistics do depend upon the X matrix configura- 
tion (see the expre^^ion for the noncentrallty parameter presented below). 
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By Ignoring sawpllng variability in the X matrix , a source of vari- 
ability in the nonnull distribution of the test statistics is being 
ignored. Consequently, .the unconditional probability of Type II error 
is underestimated. Unfortunately, the unconditional nonnull test 
statistic distribution theory is quite complex, and incorporating it 
into our presentation would take us out of the context of the classical 
general linear model. 

^The matrix expression for the t-test of hypothesis (3) is merely 

a^' 2.' S!^ divided by the standard error of the linear combination a^' 

2 —1 
b^, where the standard error is s (a ' Ql'2P S)* 

^Miller (1966) provides an exhaustive treatment of the statistical 
bases and applications of the many techniques of simultaneous statis- 
tical Inference. A less exhaustive and more applications-oriented review 
is provided by Kirk (1968). 

^The case of "theory trimming" described here is qualitatively 
different from the situation of testing an a priori hypothesized model. 
A number of procedures have been proposed for testing the fit for an a^ 
priori model where certain structural coefficients are hypothesized to 
be equal to zero (Land, 1973; McPherson and Huang, 197A; Specht, 1975). 
McPherson and Huang (197A) present an equatlon-by-equation scheme for 
testing the fit of an hypothesized recursive structural equation model 
that explicitly Incorporates simultaneous Inference considerations. If 
a single test of the global fit of an hypothesized model Is performed, 
then one is effectively removed from the simultaneous Inference case. 
If a comparison of the fit of several models Is performed (cf . Specht, 
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1975), or if an attempt is made to diagnose what specific structural 
parameters are responsible for the failure of an hypothesized model to 
hold, then considerations raised in our discussion of simultaneous 
Inference issues again become relevant. 

Q 

There are, of course, other techniques that are applicable to the 
common hypotheses tests on slope coefficients. For example, Williams 
(1972) discusses the application of Tukey's technique for making 
pairwise multiple comparisons of means within the regression framework. 
We restrict our attention to the Bonferroni and Scheff? techniques be- 
cause of their vide generality and ease of application. 

Other factors such as departures from random sampling and measure- 
ment error affect both Type I and Type II error rates. Again, we have 
slighted important Issues in order to remain within the context of the 
classical general linear model as it is most often applied in research 
by sociologists. Our point is not primarily that approximate error rate 
calculations are better than none. A more fundamental point is that the 
conceptualization of the appropriate unit of error rate and of meaning- 
ful effects to be detected will enhance our understanding of what our 
statistical analyses of survey data can and cannot tell us. 

^^Power tables approach an asymtope at about N - K « 100 denomin-- 
ator degrees of freedom. Since we art concerned with survey samples 
with generally many more than 100 observations, our calculations are 
based on tabled power for "infinite" denominator degrees of freedom. 
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^^It must be assumed here that the hypotheses being tested are 
with respect to the unstandardlzed parameters and that the standard-* 
Ized parameters are merely an arbitrary rescallng of their unstan- 
dardlzed counterparts. The GU^ distribution theory does not apply 
to the direct estimation of standardized parameters. The distribu- 
tions of the standardized estimates and test statistics. can become 
quite complex. The application of such distributions to direct 
statistical Inference with respect to standardized parameters Is 
virtually nonexistent In the social survey literature. 

12 

An Intermediate step Is required to use these charts. They 
are presented In terms of a parameter 9 where * «■ /g*^ ^ , 
where J Is the numerator degrees of freedom of the test. For 
Scheff^ projections 9 J Is the numerator degrees of freedom from the 
preliminary Joint test. 
13 

Note also that our value for $2 1960 was assumed to be based 
on a census and therefore not subject to sampling variability. If this 
were not the case, the power curves would be still lower. 
14 

Vhlle the calculations are simple , the Intermediate step of 

calculating the 4 parameter and finding power In the Pearson-Hartley 

charts can be annoying. It would be much more convenient If charts 

2 

were available for power as a function of 6 directly (as In Figure 1) 
for a number of a levels and numerator degrees of freedom. The Pearson*- 
Hartley, charts are further limited In that they have been tabulated only 
for a - .01, .05. and .10. 
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